{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20678ca7",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install graphdatascience"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "610eb05c",
   "metadata": {},
   "source": [
    "# Analyzing the evolution of life on Earth with Neo4j\n",
    "Explore the NCBI taxonomy of organisms in a graph database\n",
    "The evolution of life is a beautiful and insightful field of study that traces our origins back to the beginning of life. It helps us understand where we came from and where we are potentially going. The relationships between species are often depicted in the tree of life, which is a model used to describe relationships between various species. Since a tree structure is a form of a graph, it makes sense to store those relationships in a graph database to be analyzed and visualized.\n",
    "In this blog post, I have decided to import the NCBI taxonomy of organisms into Neo4j, a graph database, where we can easily traverse and analyze relationships between various species.\n",
    "# Environment and dataset setup\n",
    "To follow the code examples in this post, you will need to download Neo4j Desktop application. I have prepared a [database dump](https://drive.google.com/file/d/1-TNOU3KKEaDH6AtXJRQxzy8yRimx41Zt/view?usp=sharing) that you can use to easily get the Neo4j database up and running without having to import the dataset yourself. Take a look at my [previous blog post](https://tbgraph.wordpress.com/2020/11/11/dump-and-load-a-database-in-neo4j-desktop/) if you need some help with restoring the database dump.\n",
    "\n",
    "The original dataset is available on the NCBI website. I have used the new tax dump folder downloaded on 13th June 2022 to create the above database dump. While no explicit license is specified for the dataset, the NCBI website states that all information is available within the public domain.\n",
    "I have made available the code used to import the taxonomy into Neo4j on my GitHub if you want to evaluate the process or make any changes.\n",
    "# Graph schema\n",
    "I have imported the following files into Neo4j:\n",
    "* nodes.dmp\n",
    "* names.dmp\n",
    "* host.dmp\n",
    "* citations.dmp\n",
    "\n",
    "Some other files have redundant information that is already present in the nodes.dmp file that contains the taxonomy of organisms. I have looked a bit at genetic code files, but since I have no idea what to do with genetic code name and their translations, I have skipped them during import.\n",
    "\n",
    "I have added a generic label Node to all nodes present in the nodes.dmp file. The nodes with the generic label node contain multiple properties that can be used to import other files and help experts better analyze the dataset. For us, only the name property will be relevant. The taxonomy hierarchy is represented with the PARENT relationship between nodes. The dataset also contains a file that describes potential hosts of various species. Lastly, some of the nodes are mentioned in various medical sources, which are represented as the Citation nodes.\n",
    "All the nodes with the generic label Node have a secondary label that describes their rank. Some examples of ranks are Species, Family, and Genus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a6a566b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from graphdatascience import GraphDataScience\n",
    "\n",
    "host = \"bolt://44.193.28.203:7687\"\n",
    "user = \"neo4j\"\n",
    "password = \"combatants-coordinates-tugs\"\n",
    "gds = GraphDataScience(host, auth=(user, password))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48aff40a",
   "metadata": {},
   "source": [
    "# Exploratory analysis\n",
    "\n",
    "I looked for Homo Sapiens species in the dataset but couldn't find it. Interestingly, the folks at NCBI decided to name our species simply Human. We can examine the taxonomy neighborhood up to four hops with the following Cypher statement:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "e6a10b85",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>result</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[human, Neanderthal man]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[human, Homo sp. Altai]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[human, humans]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[human, humans, unclassified Homo]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>[human, humans, unclassified Homo, Homo sp.]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>[human, humans, environmental samples]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>[human, humans, environmental samples, Homo sa...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>[human, humans, Homo heidelbergensis]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>[human, humans, Homo/Pan/Gorilla group]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>[human, humans, Homo/Pan/Gorilla group, Pongidae]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>[human, humans, Homo/Pan/Gorilla group, Gorilla]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>[human, humans, Homo/Pan/Gorilla group, Pan sp...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               result\n",
       "0                            [human, Neanderthal man]\n",
       "1                             [human, Homo sp. Altai]\n",
       "2                                     [human, humans]\n",
       "3                  [human, humans, unclassified Homo]\n",
       "4        [human, humans, unclassified Homo, Homo sp.]\n",
       "5              [human, humans, environmental samples]\n",
       "6   [human, humans, environmental samples, Homo sa...\n",
       "7               [human, humans, Homo heidelbergensis]\n",
       "8             [human, humans, Homo/Pan/Gorilla group]\n",
       "9   [human, humans, Homo/Pan/Gorilla group, Pongidae]\n",
       "10   [human, humans, Homo/Pan/Gorilla group, Gorilla]\n",
       "11  [human, humans, Homo/Pan/Gorilla group, Pan sp..."
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gds.run_cypher(\"\"\"\n",
    "MATCH p=(n:Node {name:\"human\"})-[:PARENT*..3]-()\n",
    "RETURN [n in nodes(p) | n.name] AS result\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c27e0790",
   "metadata": {},
   "source": [
    "So, human node is a species that belongs to a humans genus, which is a part of the Pongidae family. After a quick Google search it seems that Pongidae taxon is obsolete, and Hominidae should be used, which is represented in the NCBI taxonomy as a super family. Interestingly, the human species has two subspecies, namely neanderthals and denisovans, which are represented under the homo sp altai node. I just learned something new about our history.\n",
    "\n",
    "The NCBI taxonomy dataset contains only 10% of the described species of life on the planet, so don't be surprised if there are missing species from the dataset.\n",
    "\n",
    "Let's examine how many species are there in the dataset with the following Cypher statement:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "efba3a88",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>speciesCount</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1981376</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   speciesCount\n",
       "0       1981376"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gds.run_cypher(\"\"\"\n",
    "MATCH (s:Species)\n",
    "RETURN count(s) AS speciesCount\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b997ddf4",
   "metadata": {},
   "source": [
    "There are almost two million species described in the dataset, which means there is plenty of room to explore.\n",
    "Next, we can examine the taxonomy hierarchy for human species all the way to the root of the tree using a simple query:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "ca617f6e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>lineage</th>\n",
       "      <th>rank</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>human</td>\n",
       "      <td>Species</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>humans</td>\n",
       "      <td>Genus</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Homo/Pan/Gorilla group</td>\n",
       "      <td>Subfamily</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Pongidae</td>\n",
       "      <td>Family</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Hominoidea</td>\n",
       "      <td>Superfamily</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Catarrhini</td>\n",
       "      <td>Parvorder</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Simiiformes</td>\n",
       "      <td>Infraorder</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Haplorrhini</td>\n",
       "      <td>Suborder</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Primates</td>\n",
       "      <td>Order</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Euarchontoglires</td>\n",
       "      <td>Superorder</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Boreotheria</td>\n",
       "      <td>Clade</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>placentals</td>\n",
       "      <td>Clade</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Theria</td>\n",
       "      <td>Clade</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>mammals</td>\n",
       "      <td>Class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>amniotes</td>\n",
       "      <td>Clade</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>tetrapods</td>\n",
       "      <td>Clade</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Dipnotetrapodomorpha</td>\n",
       "      <td>Clade</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Sarcopterygii</td>\n",
       "      <td>Superclass</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Euteleostomi</td>\n",
       "      <td>Clade</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Teleostomi</td>\n",
       "      <td>Clade</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>jawed vertebrates</td>\n",
       "      <td>Clade</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>vertebrates</td>\n",
       "      <td>Clade</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>Craniata</td>\n",
       "      <td>Subphylum</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>chordates</td>\n",
       "      <td>Phylum</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>Deuterostomia</td>\n",
       "      <td>Clade</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>Bilateria</td>\n",
       "      <td>Clade</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>Eumetazoa</td>\n",
       "      <td>Clade</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>multicellular animals</td>\n",
       "      <td>Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>opisthokonts</td>\n",
       "      <td>Clade</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>eukaryotes</td>\n",
       "      <td>Superkingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>cellular organisms</td>\n",
       "      <td>No rank</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>root</td>\n",
       "      <td>No rank</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>root</td>\n",
       "      <td>No rank</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   lineage          rank\n",
       "0                    human       Species\n",
       "1                   humans         Genus\n",
       "2   Homo/Pan/Gorilla group     Subfamily\n",
       "3                 Pongidae        Family\n",
       "4               Hominoidea   Superfamily\n",
       "5               Catarrhini     Parvorder\n",
       "6              Simiiformes    Infraorder\n",
       "7              Haplorrhini      Suborder\n",
       "8                 Primates         Order\n",
       "9         Euarchontoglires    Superorder\n",
       "10             Boreotheria         Clade\n",
       "11              placentals         Clade\n",
       "12                  Theria         Clade\n",
       "13                 mammals         Class\n",
       "14                amniotes         Clade\n",
       "15               tetrapods         Clade\n",
       "16    Dipnotetrapodomorpha         Clade\n",
       "17           Sarcopterygii    Superclass\n",
       "18            Euteleostomi         Clade\n",
       "19              Teleostomi         Clade\n",
       "20       jawed vertebrates         Clade\n",
       "21             vertebrates         Clade\n",
       "22                Craniata     Subphylum\n",
       "23               chordates        Phylum\n",
       "24           Deuterostomia         Clade\n",
       "25               Bilateria         Clade\n",
       "26               Eumetazoa         Clade\n",
       "27   multicellular animals       Kingdom\n",
       "28            opisthokonts         Clade\n",
       "29              eukaryotes  Superkingdom\n",
       "30      cellular organisms       No rank\n",
       "31                    root       No rank\n",
       "32                    root       No rank"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gds.run_cypher(\"\"\"\n",
    "MATCH p=(:Node {name:'human'})-[:PARENT*0..]->(parent)\n",
    "RETURN parent.name AS lineage, labels(parent)[1] AS rank\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61b504b9",
   "metadata": {},
   "source": [
    "It seems that there are 31 traversals needed to get from the human node to the root node. For some reason, the root node has a self-loop (relationship with itself), and that's why it shows twice in the results. In addition, a clade, a group of organisms that have evolved from a common ancestor, shows up multiple times in the hierarchy. It looks like the NCBI taxonomy is richer than what you would find with a quick Google search.\n",
    "\n",
    "Graph databases like Neo4j are also great at finding shortest paths between nodes in the graph. Now, we can answer a critical question of how close are apples to oranges in the taxonomy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "3f94f857",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'name': 'Valencia orange', 'rank': 'Species'},\n",
       " {'name': 'Microcitrus Swingle', 'rank': 'Genus'},\n",
       " {'name': 'Aurantioideae', 'rank': 'Subfamily'},\n",
       " {'name': 'Rutaceae', 'rank': 'Family'},\n",
       " {'name': 'Sapindales', 'rank': 'Order'},\n",
       " {'name': 'malvids', 'rank': 'Clade'},\n",
       " {'name': 'rosids', 'rank': 'Clade'},\n",
       " {'name': 'Pentapetalae', 'rank': 'Clade'},\n",
       " {'name': 'Gunneridae', 'rank': 'Clade'},\n",
       " {'name': 'eudicotyledons', 'rank': 'Clade'},\n",
       " {'name': 'Mesangiospermae', 'rank': 'Clade'},\n",
       " {'name': 'monocotyledons', 'rank': 'Clade'},\n",
       " {'name': 'Petrosaviidae S.W.Graham & W.S.Judd, 2007', 'rank': 'Subclass'},\n",
       " {'name': 'Commeliniflorae', 'rank': 'Clade'},\n",
       " {'name': 'Zingiberiflorae', 'rank': 'Order'},\n",
       " {'name': 'Musaceae', 'rank': 'Family'},\n",
       " {'name': 'Musa', 'rank': 'Genus'},\n",
       " {'name': 'sweet banana', 'rank': 'Species'}]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gds.run_cypher(\"\"\"\n",
    "MATCH (h:Node {name:'Valencia orange'}), (g:Node {name:'sweet banana'})\n",
    "MATCH p=shortestPath( (h)-[:PARENT*]-(g))\n",
    "RETURN [n in nodes(p) | {name: n.name, rank: labels(n)[1]}] AS path\n",
    "\"\"\")['path'][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e197c3e8",
   "metadata": {},
   "source": [
    "It seems that the closest common ancestor between sweet banana and valencia orange is Mesangiospermae clade. Mesangiospermae is a clade of flowering plants.\n",
    "\n",
    "Another use-case for traversing relationships could be finding all the species in the same family as a particular species. Here, we will visualize all the genus in the same family as the sweet banana."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "a44238d2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>genus</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Musella</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Ensete</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Musa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     genus\n",
       "0  Musella\n",
       "1   Ensete\n",
       "2     Musa"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gds.run_cypher(\"\"\"\n",
    "MATCH (:Node {name:'sweet banana'})-[:PARENT*0..]->(f:Family)\n",
    "MATCH (f)<-[:PARENT*]-(s:Genus)\n",
    "RETURN s.name AS genus\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d436eb3",
   "metadata": {},
   "source": [
    "Sweet banana belongs to the Musa genus and Musaceae family. Interestingly, there is a Musella genus, which sounds like a small Musa. In fact, after googling the Musella genus, it looks like only a single species is present in the Musella genus. The species is commonly referred to as the Chinese dwarf banana.\n",
    "# Inference with Neo4j\n",
    "In the last example, we will look at how to develop inference queries in Neo4j. Inference means we create new relationships based on a set of rules between nodes and either store them in the database or use them at query-time only. Here, I will show you an example of inference queries using new relationships only at query-time when analyzing potential hosts.\n",
    "First, we will evaluate which organism have described potential parasites in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "bff8309f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>organism</th>\n",
       "      <th>rank</th>\n",
       "      <th>potentialParasites</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>vertebrates</td>\n",
       "      <td>Clade</td>\n",
       "      <td>175285</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>human</td>\n",
       "      <td>Species</td>\n",
       "      <td>169891</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>plants</td>\n",
       "      <td>Clade</td>\n",
       "      <td>51</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Azorhizobium</td>\n",
       "      <td>Genus</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>primary endosymbiont of Schizaphis graminum</td>\n",
       "      <td>Species</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                      organism     rank  potentialParasites\n",
       "0                                  vertebrates    Clade              175285\n",
       "1                                        human  Species              169891\n",
       "2                                       plants    Clade                  51\n",
       "3                                 Azorhizobium    Genus                   0\n",
       "4  primary endosymbiont of Schizaphis graminum  Species                   0"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gds.run_cypher(\"\"\"\n",
    "MATCH (n:Node)\n",
    "RETURN n.name AS organism,\n",
    "       labels(n)[1] AS rank,\n",
    "       count{ (n)<-[:POTENTIAL_HOST]-() } AS potentialParasites\n",
    "ORDER BY potentialParasites DESC\n",
    "LIMIT 5\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df81b974",
   "metadata": {},
   "source": [
    "It seems that humans are the most described and only species with potential parasites. I would venture a guess that most if not all of the potential parasites for humans are also potential parasites for vertebrates since the counts are so close.\n",
    "\n",
    "We can check how many potential hosts organisms have with the following Cypher statement."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "86546160",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ph</th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>18359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>163434</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ph   count\n",
       "0   1   18359\n",
       "1   2  163434"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gds.run_cypher(\"\"\"\n",
    "MATCH (n:Node)\n",
    "WHERE EXISTS { (n)-[:POTENTIAL_HOST]->()}\n",
    "WITH count{ (n)-[:POTENTIAL_HOST]->() } AS ph\n",
    "RETURN ph, count(*) AS count\n",
    "ORDER BY ph\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "709fc703",
   "metadata": {},
   "source": [
    "18359 organisms have only one known host, while 163434 have two known hosts. Therefore, my hypothesis that most parasites that attack humans also potentially attack all vertebrates is valid.\n",
    "\n",
    "Here is where the inference queries comes into play. We know that vertebrates is a higher level taxon in the taxonomy of organisms. Therefore, we can traverse from vertebrates to the species level to examine which species could be potentially used as hosts.\n",
    "\n",
    "We will use the example of Monkeypox virus as it is relevant in this time. First, we can evaluate its potential hosts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "3c3337c6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>host</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>human</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>vertebrates</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          host\n",
       "0        human\n",
       "1  vertebrates"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gds.run_cypher(\"\"\"\n",
    "MATCH (n: Node {name:\"Monkeypox virus\"})-[:POTENTIAL_HOST]->(host)\n",
    "RETURN host.name AS host\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8d98d57",
   "metadata": {},
   "source": [
    "Notice that both human and vertebrates are described as potential hosts of Monkeypox virus. However, let's say we want to examine all the species that are potentially endangered by the virus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "04988b57",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>host</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>human</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Neoceratodus forsteri</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lepidosireniformes sp. BOLD:AAL6055</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Protopterus sp. NBE-2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Protopterus sp. LMN-2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Protopterus sp. BAFEN289-10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Protopterus sp. BOLD:AAL6244</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Protopterus sp. DRV-2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Protopterus sp. IMCB-2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Protopterus sp.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                  host\n",
       "0                                human\n",
       "1                Neoceratodus forsteri\n",
       "2  Lepidosireniformes sp. BOLD:AAL6055\n",
       "3             Protopterus sp. NBE-2020\n",
       "4             Protopterus sp. LMN-2018\n",
       "5          Protopterus sp. BAFEN289-10\n",
       "6         Protopterus sp. BOLD:AAL6244\n",
       "7             Protopterus sp. DRV-2007\n",
       "8            Protopterus sp. IMCB-2001\n",
       "9                      Protopterus sp."
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gds.run_cypher(\"\"\"\n",
    "MATCH (n: Node {name:\"Monkeypox virus\"})-[:POTENTIAL_HOST]->()<-[:PARENT*0..]-(host:Species)\n",
    "RETURN host.name AS host\n",
    "LIMIT 10\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1355d7a4",
   "metadata": {},
   "source": [
    "We have used a limit as there are a lot of vertebrates. Unfortunately, we don't know which of them are extinct as that would help us filter them out and identify only potential victims of the Monkeypox virus that are still alive. However, it is still an excellent example of inference in Neo4j, where we create or infer a new relationship based on the predefined set of rules at query time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0254cf84",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
